arXiv:1509.04265v2 [cs.LG] 16 Sep 2015 


Double Relief with progressive weighting 

function 

Gabriel Prat Masramon and Lluis A. Belanche Munoz 

Faculty of Computer Science 
Polytechnical University of Catalonia 
Barcelona, Spain 

{gprat,belanchejQlsi.upc.edu 


f5‘^ June 2006 


Abstract 

Feature weighting algorithms try to solve a problem of great impor¬ 
tance nowadays in machine learning: The search of a relevance measure 
for the features of a given domain. This relevance is primarily used for 
feature selection as feature weighting can be seen as a generalization of it, 
but it is also useful to better understand a problem’s domain or to guide 
an inductor in its learning process. Relief family of algorithms are proven 
to be very effective in this task. 

On previous work, a new extension was proposed that aimed for im¬ 
proving the algorithm’s performance and it waa shown that in certain cases 
it improved the weights’ estimation accuracy. However, it also seemed to 
be sensible to some characteristics of the data. An improvement of that 
previously presented extension is presented in this work that aims to make 
it more robust to problem specific characteristics. An experimental de¬ 
sign is proposed to test its performance. Results of the tests prove that it 
indeed increase the robustness of the previously proposed extension. 


1 Overview 

Feature selection is undoubtedly one of the most important problems in machine 
learning, pattern recognition and information retrieval, among others. A featnre 
selection algorithm is a compntational solution that is motivated by a certain 
definition of relevance. However, the relevance of a featnre may have several 
definitions depending on the objective that is looked after. 

On the other hand, featnre weighting algorithms try to estimate relevance 
(in the form of weights to the featnres) rather than binarily deciding whether 
a featnre is either relevant or not. This is a mnch harder problem, bnt also a 
more flexible framework from an indnctive learning perspective. This kind of 
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algorithms are confronted with the down-weighting of irrelevant features, the 
up-weighting of relevant ones and the problem of relevance assignment when 
redundancy is an issue. 

In this work we review Relief, one of the most popular feature weighting 
algorithms. Original Relief and some of its variants are presented on section [5] 
drawing heavily on own earlier material. Next, we revisit a "double" or feedback 
extension of the algorithm, that was firstly introduced in an own previous work, 
that takes its own estimations into account in order to improve general perfor¬ 
mance. Finally a new version of the algorithm is presented on section |3] that 
uses its own estimations in a progressive manner, it initially behaves like the 
traditional algorithm and gradually increases the importance of its estimates to 
behave at the end as the "double" version. An experimental design is presen- 
ten in secion |4] to test the performance of the original algorithm versus the two 
proposed ones. Finally some results and conclusions are presented. 


2 Relief 

Relief is a feature weighting algorithm that doesn’t share one common char¬ 
acteristic of the feature selection and weighting methods. Most of them treat 
features individually assuming conditional independence of features upon the 
class. In the other hand. Relief takes all other features in care when evaluating 
a specific feature. Another interesting characteristic of Relief is that it is aware 
of contextual information being able to detect local correlations of feature values 
and their ability to discriminate from an instance of a different class. 

The main idea behind Relief is to assign large weights to features that con¬ 
tribute in separating near instances of different class and joining near instances 
belonging to the same class. The word "near" in the previous sentence is of 
crucial importance since we mentioned that one of the main differences between 
Relief and the other cited methods is the ability to take local context into ac¬ 
count. Relief does not reward features that separate (join) instances of different 
(same) classes in general but features that do so for near instances. 

In Fig. [T] we can see the original algorithm presented by Kira and Rendell 
in [Kira and Rendell, 1992[ . We maintained the original notation that slightly 
differs from the used above as now features (attributes) are labeled A. There we 
can see that in the aim of detecting whether the feature is useful to discriminate 
near instances it selects two nearest neighbors of the current instance . One 
from the same class H called the nearest hit and one from the different class 
M (the original Relief algorithm only dealt with two class problems) called the 
nearest miss. With these two nearest neighbors it increases the weight ofthe 
feature if it has the same value for both Ri and H and decreases it otherwise. 
The opposite occurs with the nearest miss. Relief increases the weight of a 
feature if it has opposite values for Ri and M and decreases it otherwise. 

One of the central parts of Relief is the difference function diff which is also 
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Input: for each training instance a vector of feature values and the class value 
Output: the vector W of estimations of the qualities of features 

1. set all weights W[A] := 0.0; 

2. for i := 1 to m do begin 

3. randomly select an instance Rv, 

4. find nearest hit H and nearest miss M; 

5. for A := 1 to a do 

6. W[A] := W[A] — diff(A, Ri, H)/m + diff(A, Ri, M)/m 

7. end; 


Figure 1: Pseudo code of the original Relief algorithm 


used to compute the distance between instances as shown in Eq. 12.11 

<5(/i,/ 2) =^difr(A„/i,/2) 


( 2 . 1 ) 


The original definition of diff was an heterogeneous distance metric composed of 
the overlap metric in Eq. l2.2l for nominal features and the normalized Euclidean 
distance in Eq. l2.3l for linear features, which [Wilson and Martinez, 1997| called 
HEOM. 

rhWlA T T\ if value(A/i) = value(7l,/2) 

difi(A,/i,/2) = { . ,, . (2.2) 

1 otherwise 


difr(A, Ji, J 2 ) = 


|value(A, Ii) — value(R, / 2 )| 


(2.3) 


max(A) — min(yl) 

The difference normalization with m guarantees that the weight range is [-1,1]. 
In fact the algorithm tries to approximate a probability difference in Eq. 12.51 

W[A] «P(different value of Ajnearest instance from different class)— (2.4) 

^(different value of Ajnearest instance from same class) (2-5) 


We can see that for a set of instances I having a set of features R this algorithm 
has cost 0{m x |I| x |J^|) as it has to loop over m instances. For each instance in 
the main loop it has to compute its distance from all other instances so we have 
0{m X |I|) times the complexity of calculating Djieiief and we can easily see 
from Eq. [O that its complexity is 0(|J^|), so we have our complexity: 0{m x 
|I| X |J^|). As TO is a user defined parameter we can in some measure control 
the cost of Relief algorithm having a tradeoff between accuracy of estimation 
(for large to) and low complexity of the algorithm (for small to). However to 
can never be greater than |I|. 
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2.1 Extensions of Relief 


The first modification proposed to the algorithm is to make it deterministic 
by changing the outer loop through m randomly chosen instances for a loop 
over all instances. This obviously increases the algorithms computation cost 
which becomes 0(|Tp x |J^|) but makes experiments with small datasets more 
reproducible. Kononenko uses this simplified version of the algorithm in its 
paper [Kononenko, 1994) to test his new extensions to the original Relief. This 
version is also used by other authors [Kohavi and John, 1997) and its given the 
name Relieved with the final d for "deterministic". 

We can find some extensions to the original Relief algorithm proposed in 
[Kononenko, 1994) in order to overcome some of its limitations: It couldn’t deal 
with incomplete datasets, it was very sensible to noisy data and it could only 
deal with multi-class problems by splitting the problem into series of 2-class 
problems. 

To able Relief to deal with incomplete datasets, i.e. that contained missing 
values, a modification of the diff function is needed. The new function must be 
capable of calculating the difference between a value of a feature and a missing 
value and between two missing values in addition to the calculation of difference 
between two known values. Kononenko proposed various modifications of this 
function in its paper and found one that performed better than the others it 
was the one in a version of Relief he called RELIEF-D (not to be confused with 
Releaved mentioned above). The difference function used by RELIEF-D can be 
seen in Eq. 12.61 


diff(^, Ii,l2) = 


if Ii is missing 
if both missing 


1 — P{value{A, J 2 )|c/ass(/i)) 

1 — ^ [P{a\class{Ii)) X P{a\class{l 2 ))] 

a^A 

( 2 . 6 ) 

Now we will focus on giving Relief greater robustness against noise. This 
robustness can be achieved by increasing the number of nearest hits and misses 
to look at. This mitigates the effect of choosing a neighbor that would not 
have been the nearest without the effect of noise. The new algorithm has a 
new user defined parameter k that controls the number of nearest neighbors to 
use. In choosing k there is a tradeoff between locality and noise robustness. 
[Kononenko, 1994} states that 10 is a good choice for most purposes. 

The last limitation was that the algorithm was only designed for 2-class 
problems. The straightforward extension to multi-class problems would be to 
take as the near miss the nearest neighbor belonging to a different class. This 
variant of Relief is the so-called Relief-E by Kononenko. But later on he proposes 
another variant which gave better results: This was to take the nearest neighbor 
(or the k nearest) from each class and average their contribution so as to keep 
the contributions of hits and misses symmetric and between the interval [0,1]. 
That gives the Relief-F (ReliefF from now on) algorithm seen in Fig. 

Relation to impurity functions, in spec ific with Gini-index gain can be seen 


Robnik-Sikonja and Kononenko, 2003 when developing the probability dif¬ 


ference in Eq. I2.5l in the case that the algorithm uses a large number of nearest 
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Input: for each training instance a vector of feature values and the class value 
Output: the vector W of estimations of the qualities of features 

1. set all weights W[A] := 0.0; 

2. for i := 1 to m do begin 

3. randomly select an instance Rf, 

4. find k nearest hits Hj\ 

5. for each class C ^ class{Ri) do 

6. find k nearest misses Mj(C); 

7. for A := 1 to a do 

k 

8. W[A] :=W^[y4]- diS{A,R„Hj)/{m- k) + 

j=i 


9. 

10. end; 


E 

C^class{Ri) 


i=i 


/{m- k)-, 


Figure 2: Pseudo code of the ReliefF algorithm 
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neighbors (i.e., when the selected instance could be anyone from the set of in¬ 
stances). This version of the algorithm is called myopic ReliefF as it loses its 
context of locality property. Rewriting Eq. 12.51 by removing the neighboring 
condition and by applying Bayes’ rule, we obtain Eq. 12.71 


IT'[A] = 


Psamecl\eqvalF*eqval (1 Psamecl\eqval^ ^eqval 




samecl 


i-P. 


samecl 


(2.7) 


For sampling with replacement we obtain we have: 


Peqval = ^ P{c) 
cec 


p. 


samecl\eqval 


= E 


P{x) 






xGX V-/ 

Now we can rewrite Eq. 12.71 to obtain the myopic Relief weight estimation: 


Psamecl 1 Psamecl 


( 2 . 8 ) 


Where GG'{A) is a modified Gini-index gain of attribute A as seen in Eq. [51 


GG'(x) = y; ( X y; ^(cixjA - y; p(cf (2.9) 

xGX \^X^X 1 J cGC / ceC 

As we can see the difference in this modified version from its original Gini-index 
gain is that Gini-index gain used a factor: 


P{x) 

SxGJf Pi-P) 


P{x) 


while myopic ReliefF uses: 

P{xf 

'LxdxPi^y 

So we can see how this myopic ReliefF in Eq. 12.81 holds some kind of nor¬ 
malization for multi-valued attributes when using the factor Peqval ■ This solves 
the bias of impurity functions towards attributes with multiple values. Anther 
improvement compared with Gini-index is that Gini-index gain values decrease 
when the number of classes increase. The denominator of Eq. 12.81 avoids this 
strange behavior. 


3 Double Relief 

When more and more irrelevant features are added to a dataset the distance 
calculation of Relief degrades its performance as instances may be considered 
neighbors when in fact they are far from each other if we compute its distance 
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only with the relevant features. In such cases the algorithm may lose its context 
of locality and in the end it may fail to recognize relevant features. 

The difF(Ai,/i, 12 ) function calculates the difference between the values of 
the feature Ai for two instances Ii and 12- Sum of differences over all features 
is used to determine the distance between two instances in the nearest hit and 
miss calculation (see Eq. 1^ . 

As seen in the k-nearest neighbors classification algorithm (kNN) many 
weighting schemes which assign different weights to the features in the cal¬ 
culation of the distance between instances (see Eq. 13.11) . 

a 

S'{h J 2 ) = ^ w{A,) diff (A„ /i, /2) (3.1) 

i=l 

In the same way that in [Wettschereck et ah, 1997] Relief’s estimates of fea¬ 
tures’ quality have been used successfully as weights for the distance calculation 
of kNN we could use their estimation in the previous iteration to compute the 
distance between instances while searching the nearest hits and misses. We will 
refer to this version of ReliefF as double ReliefF or in short dReliefF. 

3.1 Progressively weighted double Relief 

The problem using the weights estimates could be that in early iterations these 
estimations could be too biased to the first instances and could be far from the 
optimal weights. So, for small t, FF[Ai] is very different from FF[Aj](. 

What we want is to begin the distance calculation without using the weight 
estimates and then, as Relief’s weight estimates become more accurate (because 
more instances have been taken into account), increase the importance of these 
weights in the distance calculation. Lets have a distance calculation like the one 
in Eq. 13.21 

a 

5{hj2) = Y, f(WiA,)t,t) diff (A„ h, I 2 ) (3.2) 

i=l 

We would like a function / : K x (0, 00 ) —>■ K such that: 

• f{w,t) is increasing with respect to t 

• is continuous 

• f{w,0) = 1 

• f{w, 00 ) = w 

One such function could be the one in Eq. 13.31 And we will refer to the 
version of ReliefF using this distance equation as progressively weighted double 
relief or in short pdReliefF. 




(w-l)c{t) . 

c{t)+s + ^ 


(3.3) 
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Where s is a control parameter that determines the steepness and final value 
of the curve described by / (see Fig. [3]) and c{t) is a function of the iteration 
number (e.g. c(t) = t). Another desirable property for our function would be 
that it always gives the same results regardless of the number of iterations. In 
other words, if m is the total number of iterations, we would like f{w,m) to 
be the same value whatever the value of m. To achieve that we must make 
c(t) depend also on the total number of iterations m so as to decrement the 
steepness of the function as the number of total iterations increases. A posible 
definition of c{t) is shown in Eq. 13.41 


c (t) = {t/m)°‘ 


(3.4) 


In Fig. |3]we can see how / varies the influence of different weights (even a non 
realistic one that is greater than 1) as iterations go on. We can see that with 
high values of s the function converges in the first few iterations and then it 
stabilizes its value near w and for low values of s it’s value remains near 1 till 
the end. To choose a value we can compute the area left over and below the 
function. We can see the normal ReliefF as a particular case where f{w,t) = 1 
having maximum area and dReliefF as another particular case with f{w, t) = w 
having minimum area. We want to choose the parameters to be in between the 
two. Specifically we could choose the parameters so as to leave 1/3 of the area 
below the function. For doing this we have to solve Eq. 13.51 


dt — w dt 1 

Idt — wdt 3 


(3.5) 


A possible combination of parameters that solves the equation are: a = 2 and 
s = 0.0633657 ~ 0.06. Graphicly it can be seen in Fig. H] that those values 
make weights’ ponderations stay near 1 for half of the iterations and then takes 
values near the weights’ values. This value has been chosen in our experiments. 


4 Experimental design 

4.1 Objective 

The above sections present three algorithms: 

ReliefF The algorithm presented by Kononenko in [Kononenko, 1994] 

dReliefF The above algorithm using it’s own partial weigts to ponderate at¬ 
tributes in distance calculation 

pdReliefF The above using a function to progressively increment the weights 
ponderation effect in distance calculation 

The objective of the experiments which will be presented is to compare perfor¬ 
mance of the three algorithms related to the factor of irrelevant attributes. The 








hypothesis is that the performance of the non-modified algorithm will be more 
affected by the number of irrelevant attributes increase due to their influence in 
distance calculation. 

4.2 Factors 

As stated before the key factor of the experiments is the ratio of irrelevant 
attributes, but there are some nuisance factors which have effect on the experi¬ 
ments’ results. The factors considered in the experiments are: 

• Problem to solve 

• Numeric vs. categoric attributes 

• Number of relevant attributes 

• Number of irrelevant attributes 

• Data randomization 

The main factor that will impact on performance results will be the problem we 
want to solve and in addition will be the most difficult to reduce. In order to 
eliminate it’s influence, all the possible problems would have to be tried which 
is obviously impossible. Another factor that can clearly impact on performance 
is the type of the attributes as Relief has an heterogeneous function for distance 
calculation which depends on whether the attributes are numeric or categoric. 
So, to reduce the effect of these two factors the same experiments will be run on 
six different problems, three with numeric attributes and three with categoric 
ones. All the problems tested will be artificial to have sufficient knowledge about 
the data not to make performance of the weighting dependent on performance 
of a classifier. 

Ranges for each factor have to be chosen. There has to be at least one 
relevant attribute and one irrelevant one in order to check whether the algorithm 
seems capable of distinguishing them, so both of them will start at 1 in our 
experiments. The number of irrelevant attributes will depend on the number of 
relevant ones in order to test with the same percentage of irrelevant attributes 
for each number of relevant attributes. A good choice could be to have at most 
twice the number of irrelevant attributes as the number relevant ones. 

The upper bound for the number of relevant attributes will depend on the 
number of instances that are to be generated. It is interesting to test the algo¬ 
rithms with a wide range of attributes to instances ratios. We may arbitrarily 
set number of instances generated to 100. With that number of instances, it 
would be interesting to have at most 150 features for the ratio of attributes to 
instances not to get too low. If we want total features to keep below 150 with 
a number of irrelevant attributes of twice the number of relevant ones, we have 
to set upper bound to the number of relevant attributes to 50. 

Finally 10 different sets of data will be generated for each combination of 
other factors to reduce the possible effect of randomly generating a pathologic 
set of data. 
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4.3 Design 

Here we have to decide which of all the possible combinations of factors will be 
tried in the experiments. The better way to reduce or eliminate the contribution 
to experimental error of each of the factors would be to treat them as blocking 
factors. That is to create homogeneous blocks in which the factors are kept 
constant while the target factor takes all its possible values. When blocking is 
not possible because of limited resources a random subset of each block can be 
run. 

With the ranges described above, there are a total of 3 x / x TV x (TV — 1) 
different factor combinations for each problem as seen on Eq. 14.11 where TV is 
the number of relevant attributes and / the number of iterations (i.e. random 
dataset generations) for each combination of relevant and irrelevant attribute 


numbers. 



(4.1) 


That gives a total number of 76,500 different combinations for each problem. 
With that number of combinations all combinations can be run. So the experi¬ 
mental design will be a full blocking design as shown on Fig. |5]in an algorithmic 
way. 

1. for each problem in problems do begin 

2. for impAtts := 1 to 50 do begin 

3. for irrAtts := 1 to impAtts * 2 do begin 

4. for iteration := 1 to 10 do begin 

5. execute problem with each algorithm; 


6. end; 


Figure 5: Pseudo code of the experimental design 


4.4 Problems 

4.4.1 RDGlNamedContinuons 

A data generator that produces data randomly with numeric attributes by pro¬ 
ducing a decision list. The decision list consists of rules. The rules have the form 
Cx ■= Ai where t is an inequality term (i.e. x < y or x > y) between some 
attribute and a random value. For each rule, the number n will be a random 
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number in the range [1..10]. An example set of rules can be seen on Eq. 14.21 


RULE 0: Co := ai < 0.986 A oq >= 0.65 
RULE 1: Cl := Oi < 0.95 A 02 < 0.129 
RULE 2: C 2 := oi >= 0.562 


(4.2) 


Instances are generated randomly one by one. The class will be determined by 
the first rule that is true for the current instance. If decision list fails to classify 
the current instance, a new rule according to this current instance is generated 
and added to the decision list. Irrelevant attributes are generated randomly in 
the range [0,1]. 

4.4.2 RandomRBFRandRedl 

Radial basis functions (RBF) are functions which characteristic feature is that 
their response decreases (or increases) monotonically with distance from a cen¬ 
tral point. There are different formulas to describe the specific shape of the 
function and they usually have parameters to control the center and the dis¬ 
tance scale. In this particular case, the function /(x) used is the Gaussian which 
is described by Eq. 14.31 and can be seen on Fig. [B] Its parameters are its mean 
/i and its standard deviation a. A Gaussian RBF monotonically decreases with 
distance from the center. 



(4.3) 



-2 


Figure 6: Plot of function f{x) with ^ = 0 and cr = 1 


RandomRBF data is generated by first creating a random set of centers 
for each class. Each center is randomly assigned a weight, a central point per 
attribute, and a standard deviation. To generate new instances, a center is 
chosen at random taking the weights of each center into consideration. Attribute 
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values are randomly generated and offset from the center, where the overall 
vector has been scaled so that its length equals a value sampled randomly from 
the Gaussian distribution of the center. The particular center chosen determines 
the class of the instance. RandomRBF data contains only numeric attributes as 
it is non-trivial to include nominal values. Irrelevant attributes are generated 
following the same Gaussian distribution for some random centers and standard 
deviation. 

4.4.3 NonMonotonic 

Let Ta be a random value in the range [0..1] to act as a ponderator for the 
attribute a. Now, for each instance i generate a random value in the rage 
[0..iV], where N is the number of important attributes. The value of the 
attribute a for instance i will be the one in Eq. 14.41 

Ta X Ti Hi mod 2^0 
^ Ta X if i mod 2 = 0 

The class for instance i will be the integer part of r^. Irrelevant attributes are 
created randomly following a uniform distribution in the range [0,1]. 

4.4.4 MajorityN 

Greates n binary attributes and i irrelevant attributes. The class attribute is 1 
when the instance has a majority of Is in the relevant attributes and 0 otherwise. 

4.4.5 ModuloP 

Each Modulo-p problem is described by a set \R\ = n of relevant attributes and 
i irrelevant attributes, both with integer values in the range [0,p). The class c 
can be defined as in Eq. 14.51 



r-eTJ 


mod p 


(4.5) 


4.4.6 RDGlNamedCategoric 

The same data generator as for RDGlNamedGontinuous but this time gener¬ 
ating boolean attributes instead of numeric ones so now the rules are boolean 
predicates. 


5 Results 

In this section the results of the above described experiments are presented. Six 
plots are presented in Fig. [T] To clearly understand what the axes represent 
some notation has to be introduced. Let TZ — ri,r 2 , ■ ■ ■ ,rn be the set relevant 
attributes and I = ii, 12 ,..., the set of irrelevant ones having \'R\ = n and 
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\X\ = m. And let w{a) be the weight assigned by the algorithm to attribute a. 
Now, the x-axis represents the total number of attributes {m + n) and the y- 
axis the separability s (i.e. the maximum weight assigned to a relevant attribute 
minus the maximum weight assigned to an irrelevant one). Formulas are shown 
in Eq. O 


x-axis: m + n 

y-axis: s = ( max w(ar) I — ( maxwlai) 

■’ \aren ^ 7 

Now, in order to accentuate the global differences between the three algo¬ 
rithms six more plots are presented with the accumulated results for the y-axis. 
Fig. [8] shows these results. Now the x-axis keeps the same definition as before 
while the y-axis is the accumulated value of the separability, so now the formula 
for the y-axis value at point Xn is the one in Eq. 15.21 knowing that Si is the 
separability defined in Eq. 15.II at point xi. 

n 

y-axis: E"* (5.2) 

i=0 

For this new axis definition, the slope of the function indicates positive or neg¬ 
ative separability. If function descends at some point then separability was 
negative, on the other hand if function is ascending at this point then separa¬ 
bility was positive. The steepness of the slope indicates the magnitude of the 
separability (either if it was positive or negative). And finally the separation 
between the curves for each algorithm tells about the accumulated difference 
of separabilities. If at the end one algorithm is above another it shows that 
the accumulated (and so the mean) separability is greater for this particular 
algorithm so one can conclude that in average this algorithm outperforms the 
other. 



6 Conclusions 

By looking at the results above, it can be seen that none of the three algorithms 
is clearly better than another for the chosen set of problems. Looking at the 
first set of plots having separability is in the x-axis, we can see that the curves 
for three algorithms are almost the same, only when there are few attributes 
dReliefF seems to have different behavior. 

An anomaly is the problem of the random RBFs, there dReliefF is clearly 
worse. In fatct, except for the majority problem dReliefF is always the worse 
algorithm and even there it is non-significantly better. A difference between 
dReliefF and the other two algorithms is that it uses the calculated weights 
as distance ponderations starting at the first iteration of the algorithm. That 
certainly may cause ReliefF to get stuck into a local minimum found in those 
first iterations because the distance function that is using does not take into 
account some of the relevant variables. In a section above where pdReliefF 
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(e) ModuloP 


(f) RDGlNamedCategorical 


Figure 7: Separability versus total number of attributes for the three algorithms. 
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(a) RDGlNamedContinuous 


(b) RandomRBFRandRed 




(c) NonMonotonic 


(d) MajorityN 



(e) ModuloP 


(f) RDGlNamedCategorical 


Figure 8: Accumulated separability versus total number of attributes for the 
three algorithms. 
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is introduced, we stated the hypothesis that using the weights estimates since 
the first iteration may cause decrease performance due to the fact that these 
estimations may be too biased to the first instances and, so, may be far from the 
optimal weights. Now the results help support this hypothesis. That could also 
explain why dReliefF’s behavior is different from the others when few attributes 
are evaluated as opposed as when more attributes are present. When there are 
few attributes to calculate distance with, making a mistake on choosing their 
ponderations makes big changes in the results, so problems with few attributes 
are more sensible to wrong distance calculations and cause dReliefF to either 
have much higher or lower performance depending on how close are the early 
weights to the real optimal weights. If the first instances seen by the algorithm 
are not representative of the whole set, for example because they share some 
common characteristic that is rare among other instances, then the weights 
used will be biased; on the other hand if these first instances give more accurate 
weight approximates, then is possible that dReliefF’s worked better than the 
rest. 

There is also another characteristic of the results to be pointed out. In the 
second set of plots where differences among the algorithms stand out clearer, 
one can see differences between the behavior of the normal version of the al¬ 
gorithm as opposed to the modified ones. In these plots, two parallel curves 
for the separability of two algorithms, indicate that their performance evolves 
in the same way, meanwhile divergent curves indicate that the performance of 
one of them increases (decreases) more than the other. Having this in mind 
the results show that for the two first problems with numeric attributes the 
performance of dReliefF decreases very quick, normal ReliefF is the best of the 
three and pdReliefF is close to it though its performance also decreases faster 
than normal ReliefF’s. Results for NonMonotonic are not clear as separability 
for that particular problem keeps very high for any number of attributes and 
the three algorithms perform almost identical. Some modifications could be 
applied to the generation of the problem to make it more difficult for ReliefF to 
discriminate attributes’ relevance (e.g. adding more noise to the relevant ones) 
and compare the performance degradation for the three algorithms. The odd 
thing is that on the contrary of what happens with numeric problems, when we 
move onto the categoric ones we can see that now the algorithm which suffers 
the least performance decrease is dReliefF followed by pdReliefF. 

So the final conclusion looking at these experimental results must be that 
although the performance of the three algorithms is frequently almost the same, 
the new algorithm pdReliefF introduced seems to be always in the middle of the 
other two quite stick to the better of the two while the other two are better or 
worse depending on the problem type, maybe depending on whether attributes 
are numeric or categoric. And also that dReliefF is very sensible to early errors 
on weight approximation of ReliefF so it must be used carefully. 

As future work, more problems could be tested and specific experiments 
should be conducted to get deeper in the hypothesis that the different versions 
of ReliefF perform different on problems with numeric or categoric attributes. 
Also some tests on real data should be done using different classifiers to contrast 


17 



them to the results on artificially generated ones. 
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